Discourse Connective - A Marker for Identifying Featured Articles in Biological Wikipedia

نویسندگان

  • Sindhuja Gopalan
  • Paolo Rosso
  • Sobha Lalitha Devi
چکیده

Wikipedia is a free-content Internet encyclopedia that can be edited by anyone who accesses it. As a result, Wikipedia contains both featured and non-featured articles. Featured articles are high-quality articles and nonfeatured articles are poor quality articles. Since there is an exponential growth of Wikipedia articles, the need to identify the featured Wikipedia articles has become indispensable so as to provide quality information to the users. As very few attempts have been carried out in the biology domain of English Wikipedia articles, we present our study to automatically measure the information quality in biological Wikipedia articles. Since the coherence shows representational information quality of a text, we have used the discourse connective count measure for our study. We compare this novel measure with two other popular approaches word count measure and explicit document model method that have been successfully applied to the task of quality measurement in Wikipedia articles. We organized the Wikipedia articles into balanced and unbalanced set. The balanced set contains featured and non-featured articles of equal length and the unbalanced set contains randomly selected featured and non-featured articles. The best result for the balanced set is obtained with F-measure of 83.2%, while using Support Vector Machine classifier with 4-gram representation and Term Frequency-Inverse Document Frequency weighting scheme. Meanwhile, the best result for unbalanced corpus is obtained using the discourse connective count measure with an F measure of 98.06%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Participation and Scientific Collaboration in Persian Wikipedia

Background and Aim: This research studies the effective participation and scientific collaboration in Persian Wikipedia, from 2003-2012.  Method: The library method has been used. Also, considering the objectives and the nature of subject, the research method is a descriptive-applied and during its implementation scientometric technique has been used. Excel and SPSS softwares have been used for...

متن کامل

A Quantitative Examination of the Impact of Featured Articles in Wikipedia

This paper presents a quantitative examination of the impact of the presentation of featured articles as quality content in the main page of several Wikipedia editions. Moreover, the paper also presents the analysis performed to determine the number of visits received by the articles promoted to the featured status. We have analyzed the visits not only in the month when articles awarded the pro...

متن کامل

Interlingual Aspects Of Wikipedia's Quality

This paper presents interim results of an ongoing project on quality issues concerning Wikipedia. One focus of research is the relation of language and quality measurement. The other one is the use of interlingual relations for quality assessment and improvement. The study is based on monoand multilingual samples of featured and non-featured Wikipedia articles in English, French, German, and It...

متن کامل

A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles

In this paper, we present a study of the collaborative writing process in Wikipedia. Our work is based on a corpus of 1,995 edits obtained from 891 article revisions in the English Wikipedia. We propose a 21-category classification scheme for edits based on Faigley and Witte’s (1981) model. Example edit categories include spelling error corrections and vandalism. In a manual multi-label annotat...

متن کامل

Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles

Uncertainty is an important linguistic phenomenon that is relevant in many areas of language processing. While earlier research mostly concentrated on the semantic aspects of uncertainty, here we focus on discourseand pragmaticsrelated aspects of uncertainty. We present a classification of such linguistic phenomena and introduce a corpus of Wikipedia articles in which the presented types of dis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Research in Computing Science

دوره 117  شماره 

صفحات  -

تاریخ انتشار 2016